NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mosaic Pages: Big TLB Reach With Small Pages

https://doi.org/10.1109/MM.2024.3409181

Han, Jaehyun; Gosakan, Krishnan; Kuszmaul, William; Mubarek, Ibrahim N; Mukherjee, Nirjhar; Sriram, Karthik; Tagliavini, Guido; West, Evan; Bender, Michael A; Bhattacharjee, Abhishek; et al (July 2024, IEEE Micro)

Full Text Available
Mosaic Pages: Big TLB Reach with Small Pages

https://doi.org/10.1145/3582016.3582021

Gosakan, Krishnan; Han, Jaehyun; Kuszmaul, William; Mubarek, Ibrahim N.; Mukherjee, Nirjhar; Sriram, Karthik; Tagliavini, Guido; West, Evan; Bender, Michael A.; Bhattarcharjee, Abhishek; et al (March 2023, ASPLOS)

Full Text Available
MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

https://doi.org/10.1145/3173162.3173169

Ausavarungnirun, Rachata; Miller, Vance; Landgraf, Joshua; Ghose, Saugata; Gandhi, Jayneel; Jog, Adwait; Rossbach, Christopher J.; Mutlu, Onur (March 2018, Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18))

Graphics Processing Units (GPUs) exploit large amounts of thread-level parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces application-level unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead.
more » « less
Full Text Available
The gem5 Simulator: Version 20.0+: A new era for the open-source computer architecture simulator

Lowe-Power, Jason; Ahmad, Abdul Mutaal; Akram, Ayaz; Alian, Mohammad; Amslinger, Rico; Andreozzi, Matteo; Armejach, Adrià; Asmussen, Nils; Bharadwaj, Srikant; Black, Gabe; et al (July 2020, ArXivorg)

The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.
more » « less
Full Text Available

Search for: All records